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Abstract 

Background: A good scoring function is essential for molecular docking computations. In conventional scoring 
functions, energy terms modeling pairwise interactions are cumulatively summed, and the best docking solution is 
selected. Here, we propose to transform protein-ligand interactions into three-dimensional geometric networks, from 
which recurring network substructures, or network motifs, are selected and used to provide probability-ranked 
interaction templates with which to score docking solutions. 

Results: A novel scoring function for protein-ligand docking, MotifScore, was developed. It is non-energy-based, and 
docking is, instead, scored by counting the occurrences of motifs of protein-ligand interaction networks constructed 
using structures of protein-ligand complexes. MotifScore has been tested on a benchmark set established by others to 
assess its ability to identify near-native complex conformations among a set of decoys. In this benchmark test, 84% of 
the highest-scored docking conformations had root-mean-square deviations (rmsds) below 2.0 A from the native 
conformation, which is comparable with the best of several energy-based docking scoring functions. Many of the top 
motifs, which comprise a multitude of chemical groups that interact simultaneously and make a highly significant 
contribution to MotifScore, capture recurrent interacting patterns beyond pairwise interactions. 

Conclusions: While providing quite good docking scores, MotifScore is quite different from conventional energy-based 
functions. MotifScore thus represents a new, network-based approach for exploring problems associated with 
molecular docking. 



Background 

Drug design and discovery play a pivotal role in driving 
research in computational chemistry and biology [1-6]. In 
computational drug design and discovery, it is often nec- 
essary to determine, as a first step, the binding of a ligand 
to a target protein. The computational scheme for pre- 
dicting ligand binding occurrence, affinity, and orienta- 
tion is commonly referred to as "molecular docking", 
which has been a topic of intensive research for decades 
[6,7]. The development of a molecular docking tool usu- 
ally starts with an efficient search algorithm, which places 
the ligand in the active site of the target protein in numer- 
ous different positions, orientations, and, in flexible 
docking, conformations. These are then evaluated by a 
scoring function to distinguish between good (near- 
native) and bad (decoy) docking solutions. The two 
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aspects of searching and scoring can be, and usually have 
been, developed and evaluated separately, although one 
clearly affects the other and a balance is often sought to 
meet specific study aims [6,8,9]. 

Scoring functions for molecular docking are tradition- 
ally either physics-based or knowledge-based [6,9] and 
differ mainly in the derivation of the mathematical mod- 
els used to compute the energies of molecular interac- 
tions. Physics-based scoring functions employ a set of 
molecular interaction terms to compute binding energies. 
For example, the scoring function used in G-Score [10,11] 
or AutoDock 3.0 [12] is based on the molecular mechan- 
ics force field used in Tripos [13] or Amber [14], respec- 
tively, and F-Score [15] and ChemScore [16] derive the 
coefficients of their energy terms using regression analy- 
sis of experimentally determined binding energies [17]. In 
contrast, knowledge-based scoring functions, such as 
PMF [18] and DrugScore [19,20], rely on statistical obser- 
vations of preferred protein-ligand contacts, from which 
binding energies are calculated. 
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A common thread in these scoring functions is that, 
with few exceptions, they all operate on the assumption 
that the total interaction will be faithfully represented by 
an additive summation over a series of pairwise interac- 
tions between interaction centers, which are either real 
(such as individual atoms) or imaginary (such as the geo- 
metric center of a group of atoms). Despite fundamental 
fallacies in this assumption, it has been adopted for many 
years, as otherwise the problem would be intractable or 
too computationally expensive to handle practically 
[21,22]. 

In this study, we investigated a new idea for developing 
a scoring function for molecular docking. The novelty of 
the idea lies in the use of a network approach to identify 
frequently occurring patterns of protein-ligand interac- 
tions not in the form of pairwise interactions, but in the 
form of network motifs, where "network" refers to a col- 
lection of pairwise interactions. The motifs, which are 
assemblies of multiple pairwise interactions between pro- 
teins and ligands that are frequently observed in a data- 
base of known protein-ligand complex structures, thus 
represent energetically favorable ways of positioning a 
ligand molecule in the active site of a protein. Such a 
motif-based approach offers a new line of exploration for 
research on molecular docking. As a first step in this line 
of research, we present the construction of a motif-based 
scoring function, called MotifScore, and the evaluation of 
its ability to distinguish between good and bad docking 
solutions. Our results using a benchmark dataset showed 
that MotifScore performed well against a number of scor- 
ing functions used in existing docking programs. 

Methods 

In constructing the protein-ligand interaction network, 
the three-dimensional (3D) coordinates of a set of pro- 
tein-ligand complexes were each transformed into an 
atom-atom interaction network. There are two types of 
nodes in the network, namely atoms (interacting centers) 
from the protein and the ligand. Two nodes (one from the 
protein and the other from the ligand) are connected by 
an edge if their interaction is deemed significant by a dis- 
tance threshold determined from a statistical analysis of a 
dataset of protein-ligand complexes (see below). Interac- 
tions between atoms within the protein or within the 
ligand are not considered. These networks were then bro- 
ken down into many interaction motifs representing sim- 
ple units of the network with specific protein-ligand 
interaction patterns. 

Datasets 

To establish distance thresholds for constructing the pro- 
tein-ligand interaction network, we performed a large- 
scale statistical analysis on a diverse set of known struc- 
tures of protein-ligand complexes. The available protein- 



ligand complex structures as of January 10, 2006 in the 
Protein Data Bank (PDB) [23] were screened. To simplify 
our task, only those determined by X-ray crystallography 
and with only one ligand molecule were selected. Fur- 
thermore, those containing DNA/RNA molecules and 
those in which the ligand binds to its target protein cova- 
lently were discarded. Complexes containing a heme 
group were also excluded because one may consider the 
heme group as a part of the protein, rather than a ligand. 
Solvent molecules and ions, such as chloride or ammo- 
nium, that are often included to facilitate crystallographic 
studies were not considered as ligands, whereas metal 
ions such as zinc or calcium ion that are coordinated by 
metal-binding amino acids were considered as an exten- 
sion of the protein molecule, and their interactions with 
ligand were considered part of the protein-ligand interac- 
tion network constructed. In all, we obtained a total of 
6,276 structures of complexes that could be used in the 
statistical analysis. 

This set of PDB protein-ligand complex structures is 
not entirely non-redundant since a single protein or its 
mutants may be co-crystallized with many similar com- 
pounds, and, conversely, the same compound may be co- 
crystallized with multiple homologous proteins. How- 
ever, unless both the protein and the ligand are com- 
pletely identical, part of the resulting interaction 
networks can still be distinct, and as defining non-redun- 
dant networks is not a straightforward task, we chose to 
include as many network connectivities (i.e. protein- 
ligand interactions) as possible while conducting normal- 
izations to minimize the potential bias of using a redun- 
dant dataset. Furthermore, analysis on a reduced set 
(4,190 structures) by excluding those that share >25% 
protein sequence identity and identical ligand showed 
that the resulting network motifs and their grades 
(defined below) had a Pearson's correlation coefficient of 
0.95 with those of the set of 6,276 structures (data not 
shown), suggesting that including homologous proteins 
and identical ligand compounds would not significantly 
affect the outcome of our analysis. 

Besides the set of 6,276 structures, we used a subset of 
the LPDB (Ligand-Protein DataBase) [24], which con- 
tains decoys generated from molecular dynamics simula- 
tions [8], to optimize the parameters of our scoring 
function, MotifScore. There are more than 200 com- 
plexes, each with a set of various numbers of decoys, in 
the LPDB, but only 113 of these were used in this work 
because we found that the ligand names, atom names, 
and the ordering of atoms of many of the decoys in LPDB 
were not the same as in the original data files in the PDB, 
making it difficult for us to correctly assign their desig- 
nated atom types (see below). For testing and evaluating 
MotifScore, the data set of Wang et al. [9] was used. The 
Wang dataset consists of a set of 100 protein-ligand com- 
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plexes, each of which comes with 100 docked conforma- 
tions (i.e. decoys) generated using AUTODOCK 3.0 [12]. 
Since Wang et al. have used this dataset as a benchmark 
to evaluate a number of docking scoring functions [9,20], 
we used it to compare the performance of MotifScore 
with that of a number of other scoring functions. 

Atom type assignment 

Based on the work of others [18,25] and the chemical 
properties of simple molecules, we created an atom type 
classification scheme to describe protein-ligand interac- 
tions. We defined a total of 23 atom types (Table 1), of 
which 14 were for protein atoms and 20 for ligand atoms, 
with many of them shared by both. The atom types were 
identified by a 3-code name, or a 3-letter name for those 
that did not need to be further classified, as they were rel- 
atively rarely observed in our dataset of protein-ligand 
complexes (e.g., metals, phosphorus, and halogens). 
Some general rules for the 3-code names were as follows. 



The 1 st code was the name of the element (C, N, O, and S) 
and the 2 nd and 3 rd code indicated the surroundings and 
electrostatic properties of the atom. The 2 nd code could 
be 2, 3, R, or L, which, respectively, correspond to sp2 or 
sp3 hybridization or inclusion in an aromatic ring or an 
aliphatic chain. The 3 rd code could be P, N, A, D, B, E, or 
C, which, respectively, correspond to polar, non-polar, 
hydrogen bond acceptor, hydrogen bond donor, an atom 
that can be both a hydrogen bond donor and acceptor or 
either a hydrogen bond donor or acceptor, or a charged 
atom. The 'either (E)' code was associated with the atom 
type NRE, which was used primarily for the two nitrogen 
atoms on the imidazole ring of a histidine, as both of the 
nitrogens can be either protonated (a hydrogen bond 
donor) or deprotonated (a hydrogen bond acceptor). For 
simplicity, the nitrogen of tryptophan, which is much 
infrequently seen than histidine especially in the active 
site, is also assigned the atom type NRE. 



Table 1 : Atom types and descriptions 



Atom type 


Description 


Exampleb 


On protein 


On ligand 


C2N 


C, SP2, normal/non-polar 


-C=C- 




X 


C2P 


C, SP2, polar 


-c=o 


X 


X 


C3N 


C, SP3, normal/non-polar 


-c-c- 


X 


X 


C3P 


CSP3, polar 


-C-OH 


X 


X 


CRN 


C, aromatic, normal/non-polar 


a benzene carbon 


X 


X 


CRP 


C, aromatic, polar 


a halogenated aromatic carbon 


X 


X 


02A 


O, SP2, hydrogen bond acceptor 


>C=0 


X 


X 


03A 


O, SP3, hydrogen bond acceptor 


C-O-C 




X 


03B 


O, SP3, both 


-OH 


X 


X 


OLC 


0, aliphatic, charged 


-COO' 


X 




ORA 


O, aromatic, hydrogen bond acceptor 


oxygen in furan 




X 


NLA 


N, aliphatic, hydrogen bond acceptor 


-NR 2 


X 


X 


NLB 


N, aliphatic, hydrogen bond donor 


-NH lor2 


X 


X 


NLC 


N, aliphatic, charged 


-NH 3 + 




X 


NRA 


N, aromatic, hydrogen bond acceptor 


a non-protonated aromatic nitrogen 




X 


NRD 


N, aromatic, hydrogen bond donor 


a protonated aromatic nitrogen 


X 


X 


NRE 


N, aromatic, either H.B. donor or 
acceptor 


nitrogen (protonated or not) of imidazole 


X 




S3N 


S, SP2, normal/non-polar 


-SH 


X 


X 


SRA 


S, aromatic, hydrogen bond acceptor 


sulfur in an aromatic ring 




X 


SLC 


S, aliphatic, charged 


-so 4 -2 




X 


PHO 


Phosphorus 






X 


MET 


Metal 3 




X 




HAL 


Halogen 


F, CI, Br, 1 




X 



a Metal ions, such as Zn, were considered in this work as an extended interaction site of the protein molecule. 



b See Fig. S3 in Additional file 1 for a representative chemical structure of the atom type CRN, CRP, ORA, NRA, NRD, NRE, and SRA. 
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In our scheme, each individual atom is itself an inter- 
acting center. For example, the chemical group P0 3 
would have four interacting centers: namely, one PHO 
(phosphorus) and three OLC (charged oxygen). Others 
have used pseudoatoms or pseudocenters (e.g., [26]), 
such as the geometric center of a chemical group, which 
may have the advantage of the resulting scoring function 
being less sensitive to the exact interaction distances 
used. On the other hand, chemical group centers do not 
fully account for the complex interactions, including the 
different interacting orientations, for example, of the con- 
stituent atoms. The interaction network motifs derived 
here represent a hybrid of both, since interactions are 
computed on real atoms but the scoring function is a 
non-energy-based network motif count, which does not 
rely on a precise range of interaction distances (see 
below). The interaction networks could also be con- 
structed using pseudoatoms instead, but whether this 
would yield better results remains to be investigated. 

For an automated atom type assignment, for protein 
molecules, we relied on standard PDB files, which have 
conventional names and thus the implicit bonding infor- 
mation for all protein atoms, and, for ligand molecules, 
we relied on the het dictionary file on the PDB website 
[23], which contains the complete naming and bonding 
information of all the ligand molecules that appear in the 
PDB files. 

Determination of atom-atom interaction thresholds 

Whether or not there is an interaction between two 
atoms is determined by the energy produced by the inter- 
action, which is affected by many variables, but mostly 
the distance between the two atoms. To save computa- 
tional time, most studies have used distance instead of 
energy as the criterion to determine whether two atoms 
interact. The threshold of distance for interaction could 
greatly affect the complexity and outcome of a study, but 
most knowledge-based docking studies seem to suggest 
that a cutoff in the range of 4-6 A between heavy atoms 
can achieve optimal performance [25]. However, a single 
cutoff may not be sufficient for the present study, 
because: 1) whereas conventional methods use a distance 
cutoff to avoid the computation of the less significant 
portion of interaction energies, we used the distance cut- 
off merely to establish network edges, which, unlike con- 
ventional scoring functions, do not involve a distance- 
dependent energy computation; 2) in the absence of a dis- 
tance-dependent energy function, clashes between two 
interacting atoms cannot be prevented. In order to appro- 
priately define atom-atom interactions (network edges), 
we therefore introduced not only an upper distance 
threshold for any pair of interacting atoms, but also a 
lower one. As described below, the values of these thresh- 
olds were determined by examining the distance distribu- 



tions of all protein-ligand atom pairs (a total of 14 protein 
atom types x 20 ligand atom types, i.e. 280) in the 6,276 
protein-ligand complex structures selected from the PDB. 

Following the concept of radial distribution [27], we 
analyzed the occurrence distribution of all atom type 
pairs as a function of separation distance. For atom type 
x, in order to determine the distance interval at which the 
highest density of atom type pair x-y occurs (hence the 
preferred interacting distance interval between x and y), 
we divided the occurrence of y by the difference between 
the cubes of the outer and inner radial distances from 
atom x (for example, 3 3 - 2 3 , when y is found between 2 
and 3 A from x, which is proportional to the volume of 
the shell within which y atoms are found). For normaliza- 
tion, we also divided the occurrence of y atoms found in 
this particular distance interval by the total occurrence of 
y atoms in the 6,276 complex structures in the PDB. The 
resulting distributions were quite distinctive from one 
atom type pair to another. However, to a large extent, 
they could all be characterized, albeit approximately, as 
one of four categories, as summarized in Fig. 1. 

The first category (Fig. la) featured a sharp peak at a 
small distance, indicating a specific distance or a narrow 
range of distances for contact (i.e. interaction) between 
the two atom types. More than one-third of the atom type 
pairs (86 out of 280) were classified in this category (see 
Table SI in Additional file 1). All of these are known to 
exhibit specific interactions, such as hydrogen bonding, 
salt bridge, and polar-polar interactions. Examples of this 
category include hydrogen bonding pairs, such as NLB- 
02A and NLB-OLC, which represent a hydrogen bond- 
ing interaction between a primary or secondary amine 
(NLB) and a double-bonded oxygen (02A) or charged 
oxygen (OLC). 

For the second category (Fig. lb), the distribution curve 
increased rapidly over a short span of distance, then 
stayed fairly flat or decreased slowly over a wide distance 
range. This type of distribution is indicative of the 
absence of a strongly preferred distance for the two atom 
types to contact each other. About half of the atom type 
pairs (137/280) fell into this category. Most were for weak 
interactions (80/137), such as C3N-C3N (two non-polar 
sp3 carbon atoms) and CRN-CRN (two non-polar aro- 
matic carbons), or for less well-defined interactions (57/ 
137) involving sulfuric or aromatic atom types, such as 
S3N-CRP (a sp3 sulfur atom that is usually non-polar, but 
may be deprotonated on rare occasions, interacting with 
a polar aromatic carbon) and NRD-CRN (a hydrogen 
bond donor nitrogen interacting with a non-polar aro- 
matic carbon). 

The distribution curve of the third category (Fig. lc) 
also began to rise after a short distance, but, unlike the 
first two categories, rose quite slowly over a large dis- 
tance. This indicates that the atom type pairs belonging 
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Figure 1 Examples of the four categories of normalized distributions of protein-ligand interatomic distances. The four represent strong in- 
teractions, weak interactions, non-specific interactions, and rare interactions (see text). Each is represented by a specific example, namely: a) a hydro- 
gen bonding contact in the interaction of NRD with OLC; b) a van der Waals contact in the interaction of C2N with CRN; c) a non-specific contact in 
the interaction of C3N with C2P; and d) a rarely seen contact in the interaction of NLC with S3N.The red arrows mark the upper and lower distance 
cutoffs used to connect atoms in constructing the protein-ligand interaction networks: those separated by a distance within the upper and the lower 
threshold would be connected, while no connection was made for the non-specific interactions in the third category. 



to this category rarely see each other at short distances. 
The few occurrences observed at short distances are 
likely to be a result of an attraction owing to adjacent 
atoms that are chemical-bonded to the atom in question. 
Thirty-eight atom type pairs were classified into this cate- 
gory, which typically exhibited an interaction between a 
non-polar atom and a polar atom or between two polar 
atom types with the same polarity, such as C2N-C2P (a 
non-polar sp2 carbon interacting with a polar sp2 car- 
bon) and NLC-NRD (a charged nitrogen interacting with 
a hydrogen bond donor nitrogen). 

The fourth category (Fig. Id) showed a noisy curve that 
did not display an easily identifiable pattern, which may 
indicate insufficient data, as they usually contained an 
atom type of rare occurrence, such as halogen in HAL- 
NLB and metal in NLC-MET. There were 19 atom type 
pairs in this category. 



A common observation for the first three categories of 
atom type pairs was a zero occurrence when the distance 
was short enough, suggesting that a distance threshold 
could be established and that only above this could 
attraction overcome repulsion, allowing the pair to be 
observed. In addition, the occurrence distribution curves 
of the first two categories fell after reaching a peak, dis- 
playing a specific upper threshold. The third category did 
not need an upper threshold, since its atom type pairs 
preferred not to contact each other. For the fourth cate- 
gory, we decided to use 4A and 2A as the upper and lower 
threshold values, which are arbitrary choices, but are 
within the range determined for the other three catego- 
ries of interaction. For other rarely observed atoms, such 
as boron, that are not defined in our simple set of 23 atom 
types (Table 1), the arbitrarily chosen thresholds of the 
fourth category along with the associated network motifs 
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can be applied to them to facilitate subsequent computa- 
tions. 

Despite the common main features, these occurrence 
distribution curves were not smooth, making an auto- 
mated determination of thresholds difficult, so we manu- 
ally inspected the data and recorded the lower and upper 
thresholds for each of the 280 atom type pairs. In all, the 
upper thresholds thus determined ranged from 3A to 
5.6A and the lower thresholds from 1.2A to 4A. (see Table 
SI in Additional file 1). 

Construction of protein-ligand interaction networks 

With the upper and lower thresholds for all the interac- 
tions determined, we then constructed protein-ligand 
interaction networks using the following procedures: 1) 
carrying out atom type assignment on the atoms of the 
protein and ligand (hydrogen atoms were not considered 
in this work), 2) checking all the protein-ligand inter- 
atomic distances against the distance thresholds of the 
corresponding atom type pairs, and 3) connecting those 
that met the thresholds and ignoring those that did not. 
As an example, Fig. 2 shows a 2D representation of the 
3D protein-ligand interaction network constructed for 
the complex of carboxypeptidase A and 1-benzylsuccinate 
(PDB entry lebx). 

Motif searching and development of MotifScore 

The next step was to identify motifs from the total of 
6,276 protein-ligand interaction networks constructed as 
described above. While a number of algorithms for dis- 
covering network motifs have been reported [28-32], they 
could not be easily adopted here because the protein- 
ligand interaction networks constructed in this study are 
a bipartite network, which is different from the conven- 
tional gene-gene or protein-protein interaction networks, 
in which there is usually only one type of node and for 
which the existing motif identification algorithms have 
usually been developed. Moreover, since there were only 
23 atom types, a number much smaller than the number 
of interacting atoms, many nodes within the same pro- 
tein-ligand network were indistinguishable, which is 
quite different from the situation in other biological net- 
works, where a node usually does not occur more than 
once in the same network. 

In the total of 6,276 protein-ligand interaction net- 
works constructed, the largest number of protein atoms 
connected to a ligand atom was 17. Thus, a network motif 
involving just one ligand atom would be those connecting 
1 (ligand atom) to 2 (protein atoms), 1 to 3,..., up to 1 to 
17. However, mathematically, of these "1 interacting with 
n" motifs, the number of those connecting to 7 or 8 pro- 



^C2P(C1): 



CRP(HIS196-CG) 
NRE(HIS196-ND1) 
C3P(GLU270-CD) 
03B(TYR248-OH) 
CRP(HIS69-CE1) 
NLC(ARG127-NH2) 
NLB(ASN144-ND2) 
OLC(GLU72-OE2) 
C3P(ARG145-CZ) 
NLC(ARG145-NH1) 
NLC(ARG145-NH2) 
OLC(GLU270-OE1) 
OLC(GLU270-OE2) 
C3N(ILE247-CG2) 
C3N(ALA250-CB) 
C3N(THR268-CG2) 
C3P(GLY253-CA) 
C3N(ILE243-CD1) 
C3P(THR268-CB) 

• 03B(THR268-0G1) 
NRE(HIS69-ND1) 
NLC(ARG127-NH1) 
C3P(GLU72-CD) 
OLC(GLU72-OEl) 
C3N(HIS196-CB) 

• C2P(SER197-C) 
02A(SER197-0) 

• NRE(HIS69-NE2) 
C3N(ASN144-CB) 
C3P(ASN144-CG) 
CRP(HIS196-CE1) 
CRN(TYR248-CE1) 
CRP(TYR248-CZ) 



Figure 2 A protein-ligand interaction network This example 
shows the protein-ligand interaction network transformed from the 
3D coordinates of the complex of carboxypeptidase A and l-benzylsuc- 
cinate (PDB entry 1cbx). Each ligand atom in this network is represent- 
ed by a combination of a red circle, its 3-code atom type, and its atom 
name from the PDB file in the parenthesis. Likewise, each protein atom 
is represented by a combination of a blue circle, its 3-code atom type 
and, in parentheses, its residue name, residue number, and atom name 
from the PDB file. Each black line connects a ligand atom and a protein 
atom for which connectivity (interaction) has been established based 
on distance thresholds. 




tein atoms would be the largest (i.e. C] 7 and Cg 7 >> all 
other C 1 / )• m the actual data, the 1 interacting with 7 
and 1 interacting with 8 motifs indeed dominated the 1 
interacting with n motifs (data not shown), and, because 
the number of 1 interacting with 7 and 1 interacting with 
8 motifs was so huge, the other types of motifs became 



Xie and Hwang BMC Bioinformatics 201 0, 1 1 :298 
http://www.biomedcentral.eom/1 471 -2 1 05/1 1 /298 



Page 7 of 16 



negligible and a scoring function based on these would be 
severely biased to the 1 interacting with 7 and 1 interact- 
ing with 8 motifs. Consequently, we decided to search for 
other simple motifs, such as those of 2 ligand atoms inter- 
acting with 3 protein atoms, where the scoring function 

would not be severely biased to a particular type of motif. 

There are four distinctive topologies of the 2 ligand 
atoms interacting with 3 protein atoms motifs, as shown 
in Fig. 3. In all, we obtained 395,240 such interaction 
motifs, the large number arising from the numerous com- 
binations of the 23 atom types (Table 1) that could be 
assigned to the five constituent atoms. 

For various reasons, other simple motifs were not 
included in MotifScore: the 1 interacting with 2 and 2 
interacting with 1 motifs are likely to occur randomly, as 
they contain only two connections; there was a compara- 
tively very small number of 3 (ligand atoms) interacting 
with 1, 3 interacting with 2, and 3 interacting with 3 
motifs due to the small number of ligand atoms that can 
simultaneously interact with protein atoms under our 
interaction criteria; and, for the 2 interacting with 2 and 1 
interacting with 3 motifs, no significant difference in the 
performance of MotifScore was seen when they were 
included (data not shown). 

Intuitively, without the guidance of energy computa- 
tion, a network motif-based scoring function stipulates 
that the presence of more motifs implicates better inter- 
actions (i.e. better docking solutions). However, in devel- 
oping MotifScore, we immediately faced two major 
difficulties: 1) as mentioned, due to the large number of 
combinations of atom types, even the simple 2-3 network 
topologies (Fig. 3) harbored a large number of distinctive 
motifs which should not all be counted equally, because, 
for instance, some motifs consist of more frequently 
occurring atom types than others; 2) if the ligand mole- 
cule was pushed into the binding site to make more con- 
tacts with the protein, new motifs would be created and 
some of the old motifs removed, but the rate of increase 
would far outweigh the rate of loss. To overcome these 
difficulties, MotifScore was consequently made a compos- 




Figure 3 Four distinctive types of simple motifs considered in this 
work. The four have different connection topologies: 2 red nodes (li- 
gand atoms) interacting with 3 blue nodes (protein atoms) with 4, 5, or 
6 connections. The interactions are symbolized by the black lines. 



ite of two opposing components, Gain and Penalty, with 
Gain being a sum of the normalized motif counts and 
Penalty being a factor using number of clashes between 
ligand and protein atoms to avoid motif overcounting, as 
described below. 

In Eq. 1 we define, for each of the total 395,240 motifs, a 
significance grade, SG, which reflects the relative impor- 
tance of a specific motif in the network. 



SGj = log 



OMj/Nj 

5 

n {OAj/Mj) 



(1) 



where QM ( denotes the occurrence of a specific motif i 
in the 6,276 protein-ligand interaction networks con- 
structed from the complex structures in the PDB; motif i 
is one of 395,240 motifs, each of which consists of 5 
atoms, 2 on the ligand and 3 on the protein, and of 4-6 
network edges, depending on which of the four motif 
types (Fig. 3) motif i belongs to; N { is the number of 
motifs having the same motif type as motif i; OA^ denotes 
the occurrence of atom type /' (one of 14 protein atom 
types or 20 ligand atom types; see Table 1) in the 6,276 
networks, where atom type /' (J = 1,5) is the atom type 
assigned to one of the 5 atoms constituting motif i; and 
Mj is either M t or M p {M t if atom type /' is a ligand atom 
and M p if atom type /' is a protein atom), where M l (M p ) is 
the total number of ligand (protein) atoms. 

Using normalization by the total occurrence of each 
motif type and the relative occurrence of each atom type, 
Eq. 1 takes into account the fact that motifs consisting of 
fewer edges or more prevalent atom types are likely to 
occur more often. With Eq. 1, a motif's SG would not 
necessarily increase in a larger dataset of protein-ligand 
complex structures in which motif occurrences are 
bound to increase. Because each motif's SG reflects the 
probability of the motif occurring in the interaction net- 
work, the probability of it occurring simultaneously with 
other motifs should be a multiplication of all their SGs. 
For easier computation, motif SG was log transformed to 
convert multiplication into addition. 

To score a specific protein-ligand interaction network, 
i.e. a specific protein-ligand complex conformation, the 
Gain component (Eq. 2) of MotifScore is the sum of the 
motif SG (Eq. 1) for all the motifs found in this specific 
network, or the particular protein-ligand binding confor- 
mation from which the network resulted. 



Gain ■ 



SG h 



(2) 
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where SG k is the motif significance grade (Eq. 1) for the 
k th motif found in the specific network constructed from 
a specific docking solution, and n is the total number of 
motifs present in this network. 

As mentioned earlier, we need a penalty to counteract 
the excessive motif gains of a growing interaction net- 
work that may occur in docking when one places the 
ligand molecule closer and closer to its protein receptor. 



Penalty = WxNC 



(3) 



In Eq.3, NC denotes the total number of clashes 
between ligand and protein atoms, where one clash is 
counted when the distance between two atoms, one each 
from the protein and ligand, is less than the lower thresh- 
old of their atom type pair (see Fig. 1) and weight Wis a 
parameter to be optimized against a subset of LPDB, 
which served as a training set, as described above. The 
value of 10 for W appeared to yield the best result, though 
the results were not very sensitive to the value of this 
parameter (Fig. SI in Additional file 1). A second parame- 
ter was also introduced to adjust the lower distance 
thresholds of the interacting atom type pairs for best fit of 
the training data. As described earlier, the lower thresh- 
olds were determined based on the observed distribution 
of interatomic distances. However, at a distance barely 
exceeding the lower threshold, the probability of observ- 
ing the atom type pairs in the native structures is rather 
low, whereas, when evaluating the conformations pro- 
duced in the docking process, for any distance that was 
above the lower threshold, even just barely, a connection/ 
interaction would be made and it would contribute to 
motif Gain. In order to reflect this reality better, we mod- 
ified the lower distance thresholds by a factor and found 
that better results could be obtained when the lower dis- 
tance thresholds for computing the Gain component 
were raised by 10 percent compared to those used to 
compute Penalties (Fig. S2 in Additional file 1), though, 
like the Penalty weight W(Fig. SI in Additional file 1), the 
results were not very sensitive to the value of the lower 
threshold factor used (Fig. S2 in Additional file 1). 

Finally, after experimenting with several different com- 
posite formulae, we settled on one with a ratio for Motif- 
Score (Eq. 4), which yielded the best results. 



MotifScore ■ 



Gain 
Penalty+1 



(4) 



Results and Discussion 

Scoring docking solutions 

As in many related studies, we evaluated the performance 
of MotifScore using the criteria of distance rmsd (root- 
mean-square-deviation) between the experimentally 



observed ligand positions and those of the highest-scored 
docking solution to consider whether the scoring was a 
success or a failure. We used two docking datasets, one by 
Brooks et al. [8] for parameter optimization (Fig S1,S2 in 
Additional file 1) and the other by Wang et al. [9] for per- 
formance comparison with 12 other scoring functions 
(Table 2). Our success rate using the criterion of rmsd %2 
A was 84% for both the parameter optimization set (Fig. 
SI and S2 in Additional file 1) and Wang's benchmark test 
set (Table 2). As shown in Table 2, compared to other 
scoring functions, MotifScore performed admirably, being 
second only to DrugScore CSD [20]. DrugScore CSD is a sig- 
nificantly improved version of DrugScore PDB [19]. Drug- 
Score PDB was developed based on the PDB and 
DrugScore CSD was developed based on the Cambridge 
Structural Database (CSD) [33], which is a higher-resolu- 
tion data source for contact distances between interact- 
ing atoms, a difference thought to underlie the 
improvement [20]. Whether MotifScore can be similarly 
improved using the CSD instead of the PDB remains to 
be determined, but the effect may not be as significant as 
in DrugScore CSD , as it is likely that the network motif- 
based scoring (Eq. 1-4) cannot be fine-tuned as much as a 
conventional energy-based scoring function, such as that 
used by DrugScore CSD . Nevertheless, it is encouraging 
that, despite its present crude form, MotifScore per- 
formed surprisingly well and therefore can serve as a 
non-conventional alternative to existing scoring func- 
tions, being useful especially for coarse-grained docking 
computations. 

The 100-complex dataset of Wang et al. [9] had also 
been divided into subsets of hydrophilic, mixed, and 
hydrophobic complexes to evaluate potential bias of scor- 
ing functions on different types of molecular interactions. 
As summarized in Table 3, MotifScore achieved a very 
high success rate of 91% on both the hydrophilic and 
mixed subsets, but, like many other scoring functions, 
including DrugScore PDB (result for DrugScore CSD not 
available), its performance was relatively poor for the 
hydrophobic subset. The less favorable result for the 
hydrophobic subset is thought to be due to several diffi- 
cult cases in this subset in which the binding site is shal- 
low and hard to score accurately [9], compounded by the 
fact that the number (24) of complexes in this subset is 
much smaller than in the other two subsets, which means 
that the success rate would drop by as much as 4% if the 
number of correctly predicted complexes was reduced by 
only one. 

Some significant motifs 

To elucidate which of the hundreds of thousands of 
motifs make the greatest contribution to MotifScore, we 
ranked them by their motif SG. The top 30 motifs are 
listed in Table 4. All contain aromatic and/or polar 
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Table 2: Docking scoring success rates for different scoring functions. 3 
Scoring function Success rate (%) using different rmsd criteria 11 





%1 A %2 A 




%3A 


DrugScore CSD 


83 87 




*c 


MotifScore 


71 84 




86 


Cerius2/PLP 


63 76 




80 


SYBYL/F-Score 


56 74 




77 


Cerius2/LigScore 


64 74 




76 


DrugScore PDB 


63 72 




74 


Cerius2/LUDI 


43 67 




67 


X-Score 


37 66 




74 


AutoDock 


34 62 




72 


Cerius2/PMF 


40 52 




57 


SYBYL/G-Score 


24 42 




56 


SYBYL/ChemScore 


12 35 




40 


SYBYL/D-Score 


8 26 




41 


a The data were taken from Wang et al. [9] and Velec et al. [20], except for the MotifScore results, which were computed in this work using the 
same dataset. 

b Scoring functions are ranked by their success rates at rmsd %2.0 A (the success rate of a scoring function is calculated by checking if the 
rmsd value of the highest-scored conformation is less than, or equal to, the specified rmsd criterion from the experimentally observed 
conformation.) 

c Not provided by Velec et al. [20] 


(including charged) atom types on the protein side. This [34,35]. In addition, although cysteine occurs less often 
agrees well with the observation that the frequently than other amino acids in proteins, it is relatively 
observed catalytic residues of proteins are generally the enriched in catalytic residues [35]. The sulfur atom of 
residues with aromatic, polar, or charged side-chains cysteines, assigned as S3N, was also commonly observed 


Table 3: Success rates of different scoring functions on subsets of different types of molecular interactions 


Scoring function 


Success rate (%) 






Overall (100) 


Hydrophilic (44) Mixed (32) 


Hydrophobic (24) 


MotifScore 84 


91 


91 


63 


Cerius2/PLP 76 


77 


78 


71 


SYBYL/F-Score 74 


75 


75 


71 


Cerius2/LigScore 74 


77 


75 


67 


DrugScore PDB 72 


73 


81 


58 


Cerius2/LUDI 67 


75 


66 


54 


X-Score 66 


82 


59 


46 


AutoDock 62 


73 


53 


54 


Cerius2/PMF 52 


68 


44 


33 


SYBYL/G-Score 42 


55 


34 


29 


SYBYL/ChemScore 35 


32 


34 


42 


SYBYL/D-Score 26 


23 


28 


29 



The data were taken from Wang et al. [9], except for MotifScore. The success rates were determined using rmsd % 2.0 A. 
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Table 4: Top 30 protein-ligand interaction network motifs ranked by motif significance grades, SG. 

Rank Ligand atom Protein atom Occurrence Significance Motif type 

types types grade 



1 


HAL, HAL 


CRN,S3N, S3N 


31 


11.79 


5 connections 


2 


C2N,ORA 


NRE.S3N, S3N 


1 


11.69 


5 connections 


3 


HAL, HAL 


S3N, S3N, S3N 


6 


11.67 


4 connections, 2+2 


4 


C2N,ORA 


CRP, S3N, S3N 


1 


10.89 


5 connections 


5 


NRA, NRA 


CRN, CRN, NRE 


1597 


10.71 


Fully connected 


6 


HAL, HAL 


NRE, 03B, S3N 


5 


10.62 


5 connections 


7 


CRN, ORA 


CRN, CRN, CRN 


374 


10.57 


Fully connected 


8 


C2N,ORA 


S3N, S3N, NRE 


1 


10.55 


4 connections, 2+2 


9 


CRN, NLC 


CRN, CRN, CRN 


2372 


10.54 


Fully connected 


10 


CRP, NRA 


CRN, CRN, CRP 


5236 


10.49 


Fully connected 


11 


C2N,ORA 


NRE.S3N, NLB 


1 


10.40 


5 connections 


12 


NRA, NRA 


CRN, CRN, CRP 


2572 


10.40 


Fully connected 


13 


CRP, NRA 


CRN, CRP, CRP 


1227 


10.36 


Fully connected 


14 


C2N, C2N 


CRN, CRN, NRE 


66 


10.29 


Fully connected 


15 


HAL, HAL 


CRP, 03B, S3N 


8 


10.29 


5 connections 


16 


HAL, NLC 


CRN, S3N, S3N 


25 


10.28 


4 connections, 3+1 


17 


NRA, NRA 


CRN, NRE, NRE 


125 


10.28 


Fully connected 


18 


NRA, NRA 


CRN, CRP, NRE 


259 


10.22 


Fully connected 


19 


HAL, HAL 


S3N,CRN, S3N 


27 


10.20 


4 connections, 3+1 


20 


CRN, NLC 


CRN, CRN, CRP 


450 


10.20 


Fully connected 


21 


HAL, HAL 


CRN, 03B, S3N 


2 


10.17 


Fully connected 


22 


CRN, CRN 


CRN, CRN, CRN 


37501 


10.12 


Fully connected 


23 


CRP, CRP 


CRN, CRP, CRP 


1676 


10.05 


Fully connected 


24 


HAL, HAL 


S3N, NRE, S3N 


2 


10.03 


4 connections, 2+2 


25 


HAL, NLA 


S3N, S3N, NLC 


2 


10.01 


5 connections 


26 


NRA, NRA 


CRN, CRN, CRN 


6605 


10.01 


Fully connected 


27 


HAL, NLC 


CRN, S3N, NLB 


50 


10.00 


4 connections, 2+2 


28 


HAL, HAL 


CRN, S3N, S3N 


14 


9.86 


4 connections, 2+2 


29 


HAL, HAL 


CRN, CRN, CRN 


68 


9.86 


Fully connected 


30 


HAL, NLC 


CRN.S3N, NLC 


50 


9.82 


4 connections, 2+2 



The occurrences were counted in the 6,276 protein-ligand complex structures used to construct the networks (see Methods). See Fig. 3 for 
the different motif types. 



in our top-ranked motifs. Similarly, there is a relatively 
small number of halogens in protein binding ligands, but, 
when present, these atoms usually play a key role in pro- 
tein-ligand interactions. The normalized motif count, or 
SG, reflects the significance of these halogens: atom type 
HAL-containing motifs appear several times in the top 30 
motifs even though their occurrences are small. 

Fig. 4a-d illustrate four of the top 30 ranked motifs 
observed in the 6,276 protein-ligand complexes. The four 
examples all exhibit interactions between aromatic atom 
types. As can be seen, these motifs are a composite of 



multiple pairwise interactions between two ligand inter- 
acting sites (atoms) and three protein atoms from one (c 
and d), two (b), or three (a) amino acid residues. Interest- 
ingly, in Fig. 4b-d, the aromatic rings of the ligand and of 
the amino acid side-chains stack against each other, form- 
ingn-ninteractions [36]. These ring-stacking interactions 
are not easily modeled by conventional scoring functions 
using separate accounts of pair-wise interactions, but, 
collectively, they emerge as significant motifs in protein- 
ligand interaction networks. 
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Figure 4 Four significant motifs involving interactions between 
aromatic rings. Ligand molecules are shown as line model and pro- 
tein side-chains as stick model. The green lines show the connectivities 
(edges) of the motifs, which, in these examples, are a 2 (ligand atoms) 
interacting with 3 (protein atoms) fully connected motif, (a) (NRA, NRA) 
interacting with (CRN, CRN, CRP): 2 ligand atoms interacting with 3 pro- 
tein atoms located on 3 different aromatic residues; (b) (NRA, NRA) in- 
teracting with (CRN, CRP, NRE): 3 protein atoms located on 2 aromatic 
residues and the aromatic ring of histidine is parallel to that of adenine 
in the ligand; (c) (CRN, ORA) interacting with (CRN, CRN, CRN): all 3 pro- 
tein atoms located on 1 residue, tyrosine; (d) (CRN, CRN) interacting 
with (CRN, CRN, CRN): this motif is the most prevalent motif (in num- 
ber) and is ranked 22 nd (by significance grade) in Table 4. 

Binding site-enriched protein triangles 

Since all of the protein-ligand interaction motifs in our 
model contain 3 protein atoms, we wondered whether the 
chance of observing three spatially close protein atoms 
simultaneously was quite different in the ligand-binding 
site compared to the rest of the protein. To answer this 
question, we computed a binding site enrichment factor, 
F b , using Eq. 5, for every distinguishable group of three 
protein atom types captured in our 2 interacting with 3 
ligand-protein interaction motifs. For convenience, we 
called the three protein atoms a protein triangle. Note 
that, as protein triangles are distinguished by the atom 
types assigned to their three constituent atoms, they are 
atom type triangles. 



Eq. 5 was adopted here based on the work of Zeeberg 
et. al. [37], where n s and N s are the occurrences of a spe- 
cific atom type triangle found, in, respectively, the inter- 
action network and the whole protein and n and N are the 
corresponding total number of atom type triangles. To 
count the triangles not only in the binding site, but also in 
the whole protein, we searched for those with all three 



sides longer than 2 A and two less than 10 A and the 
other less than 13 A, which would account for almost all 
(>99.5%) of the triangles present in the protein-ligand 
interaction motifs identified. Theoretically, with 14 pro- 
tein atom types and taking repetition into account, we 
should have a total of 560 different atom type triangles. 
However, as metal ions rarely occur in our dataset of pro- 
tein structures and almost all of those that do are 
involved in ligand binding, the F b s for metal-containing 
triangles were extremely large. They were therefore 
excluded in the discussion below, leaving 455 non-metal 
triangles to be ranked and sorted by their F b (see Table S2 
in Additional file 1). 

The binding site enrichment factor F b represents a pro- 
tein triangle's propensity for occurring in the protein- 
ligand interaction networks constructed, or, in other 
words, the ligand-binding sites. As shown in Fig. 5, such 
propensity is far from uniform, and, in fact, the distribu- 
tion resembles that observed for a wide spectrum of bio- 
logical properties: the distribution of F b tended to follow a 
power law distribution [38], y~x Y(ywas estimated to be 
1.095, r 2 = 0.89), suggesting that only a handful of trian- 
gles were highly enriched in ligand-binding sites, while 
most showed little or no propensity [223/455 (49%) had 
an F b below 2 and 101 (22%) below 1]. Some low fre- 
quency triangles appear to be concentrated at ligand- 
binding sites. For example, the NRE-NRE-NRE triangles, 
which are triangles that connect 2 or 3 different histidines 
(there were few NREs from tryptophan in our motifs), 
were not commonly observed, but, when they were, they 
tended to be at ligand-binding sites (their F b was 17.3). 
The uneven distribution of F b (Fig. 5), which was normal- 
ized to account for the wide range of occurrences of these 
triangles (Eq. 5), suggests that F b does not necessarily cor- 
relate with binding site occurrence (Fig. 6), although tri- 
angles with a high F b tended to occur less frequently in 
other parts of the protein (Fig. 7). Many of the highly 
enriched triangles were constituents of the top-ranked 
motifs (Table 4, 5) that were major contributors to the 
MotifScore. Consequently, as can be seen from Fig. 8, tri- 
angles with higher F b values also tended to form motifs 
with a higher SG (Eq. 1); the correlation between the two 
was significant, with the Spearman correlation coefficient 
being 0.78. 

Interestingly, Table 5 shows that some of the binding- 
site enriched protein triangles consisted of 3 charged or 
polar atom types with the same positive (or partially posi- 
tive) or negative (or partially negative) charge. For exam- 
ple, the most enriched triangle consisted of 3 NLCs, 
which are nitrogen atoms with a positive charge on lysine 
or arginine residues. The reason why these positively 
charged amino acids, which are spatially close, as they 



Xie and Hwang BMC Bioinformatics 201 0, 1 1 :298 
http://www.biomedcentral.eom/1 471 -2 1 05/1 1 /298 



Page 12 of 16 



70 



60 



50 



40 



30 



20 



10 




1 2 
log (Rank of Triangles) 



101 



201 301 
Triangles (Ranked by Fb) 



401 



Figure 5 Uneven distribution of the binding-site enrichment factors (F b ) for 455 protein atom type triangles. The insert shows the fitting of 
the distribution to a power-law expression, y~x~v, withyestimated to be 1 .095 (r 2 = 0.89) for the best fit. 



form the triangle, are enriched in the active site is proba- 
bly because they all interact with the ligand molecule at 
the same time. 



As shown in Table 5, the top 30 binding site-enriched 
triangles were generally composed of atom types (namely, 
CRP, NRE, NLC, NLB, OLC, 03B, and S3N) that exist on 
the side-chains of the amino acids histidine, lysine, argin- 




101 201 301 

Triangles (Ranked by Fb) 



Figure 6 Occurrences of protein atom type triangles in protein-li- 
gand interaction networks (i.e. ligand binding sites) do not corre- 
late well with binding site enrichment factors The protein atom 
type triangles are ranked by the binding site enrichment factor F b . 
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Figure 7 Occurrences of protein atom type triangles in the whole 
protein show trend with binding site enrichment factors. The pro- 
tein atom type triangles are ranked by the binding site enrichment fac- 
tor F h . 
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Figure 8 Binding site enrichment versus significant grade. The 

scatterplot shows the relationship between the binding site enrich- 
ment factor F b for protein triangles versus the average significance 
grade (SG, Eq. 1) for the triangle-containing motifs. The correlation be- 
tween the two has a Spearman correlation coefficient value of 0.78. 

v J 

ine, asparagine, aspartic acid, glutamic acid, glutamine, 
serine, tyrosine, threonine, and cysteine. A survey by Bar- 
lett et. al. [34] found that, of the 20 amino acids, only 
these 11 polar and charged residues are directly involved 
in catalytic reactions. Gutteridege and Thornton [35] fur- 
ther proposed that histidine is the most commonly 
observed and most important residue in protein enzy- 
matic reactions, followed by, in descending order of 
observance, cysteine, the charged residues glutamate, 
aspartate, arginine, and lysine, and, finally, the polar resi- 
dues serine, threonine, tyrosine, glutamine, and asparag- 
ine. This is fairly consistent with our statistics on the 
ligand-binding site enrichment: Of the 30 protein trian- 
gles that were most enriched in ligand-binding sites, CRP 
(on residues with polar and aromatic side chain, including 
histidine) was the most commonly observed atom type, 
while NLC (on lysine or arginine), NRE (primarily on his- 
tidine), and OLC (on aspartic acid or glutamic acid) were 
also frequently observed, and NLB (on glutamine or 
asparagines) and 03B (on serine, threonine, or tyrosine) 
on polar residues were observed less often. Unlike in 
Table 4, S3N (on cysteine) appeared only once in Table 5. 
This can be explained by the fact that whereas every pro- 
tein triangle in Table 5 is unique, the same triangle can 
appear multiple times in Table 4 together with different 
interacting ligand atoms and/or in different motif topolo- 
gies (Fig. 3). 

Many of the binding site-enriched triangles are com- 
posites of multiple residues. For example, many of the 
NRE-NRE-OLC triangles were composed of two histi- 
dines and a carboxylate residue, corresponding to the 2- 
His-l-carboxylate facial triad, a characteristic motif of a 
metalloenzyme superfamily [39]. Similarly, many of the 
NRE-NRE-NRE triangles were composed of three histi- 



dines, known as the histidine triad of the nucleotide- 
binding histidine triad superfamily [40,41]. Interestingly, 
while only 28 and 2 protein structures were annotated by 
PDB [23] as having, respectively, a histidine-triad or a 2- 
His-l-carboxylate facial triad in the 6,276 protein-ligand 
complex structures we analyzed, roughly 2% and 5% of 
these proteins contain NRE-NRE-NRE (on 3 different 
histidines) and NRE-NRE-OLC (on 2 histidines and an 
aspartic/glutamic acid) atom type triangles in their 
ligand-binding sites, respectively. Although these pro- 
teins may not be members of the histidine triad super- 
family or the 2-His-l -carboxylate facial triad superfamily 
per se, the much higher prevalence than previously rec- 
ognized of these motifs may guide future investigations to 
identify conserved catalytic mechanisms in diverse 
enzyme families. 

Issues and prospects of MotifScore 

Although MotifScore has been derived using static crys- 
tallographic structures of protein-ligand complexes, the 
use of non-precise interaction distances and counts of 
number of motifs instead of interaction energies likely 
renders it more tolerant to subtle conformational changes 
in protein or ligand than are conventional energy-based 
scoring functions, which are known to be sensitive to 
steric clashes especially in unbound dockings [7,42]. The 
results on both the LPDB and Wang's dataset attest the 
ability of MotifScore to account for at least ligand flexibil- 
ities, since the ligands, though not the proteins, in the 
decoys of both sets include flexible conformations gener- 
ated by molecular dynamics simulation and genetic algo- 
rithm, respectively. Nevertheless, the performance of 
MotifScore in dockings that sample different protein con- 
formations (e.g. [42]) requires further examinations. 

MotifScore can be regarded as a kind of knowledge- 
based scoring function since it was also derived by 
extracting information from a statistical analysis of 
known protein-ligand complex structures. However, 
MotifScore is different from conventional knowledge- 
based scoring functions such as PMF and DrugScore [18- 
20] in at least two aspects: 1) it is non-energy based, and 
2) using the interaction network motifs, it can score 
directly on the 3D interaction patterns of molecular rec- 
ognition conserved in protein ligand complexes. One dis- 
advantage of MotifScore, however, is that a non- 
conventional search scheme may need to be developed to 
take advantages of its unique features, as discussed next. 

As MotifScore is non-energy-based, it was of interest to 
examine the landscape of its functional values. We found 
that MotifScore did not correlate well with experimentally 
determined binding affinities, the Spearmen correlation 
coefficient between the two being only 0.259, only better 
than that (0.141) of AutoDock among the various scoring 
functions evaluated by Wang et al. [9]. Nevertheless, Fig. 
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Table 5: Top 30 protein atom type triangles ranked by their binding site enrichment factor F b . 



Rank 


Triangle 


F b 


Coverage%* (binding site) 


Coverage 0 /)* (whole protein) 


1 


NLC-NLC-NLC 


70.6 


13.2 


93.7 


2 


CRP-CRP-CRP 


61.9 


23.3 


96.6 


3 


CRP-NLC-NLC 


47.2 


16.6 


98.0 


4 


CRP-CRP-NLC 


44.3 


20.1 


97.7 


5 


CRP-CRP-OLC 


38.1 


22.6 


98.3 


6 


NLC-NLC-NRE 


32.8 


8.2 


95.7 


7 


CRP-CRP-NRE 


31.7 


16.8 


93.6 


8 


CRP-CRP-03B 


29.8 


26.4 


98.2 


9 


NLC-NLC-03B 


29.6 


16.3 


98.6 


10 


CRP-CRP-NLB 


26.2 


19.8 


98.1 


11 


CRN-CRP-CRP 


26.2 


32.4 


98.5 


12 


CRP-OLC-OLC 


25.8 


21.0 


98.8 


13 


NLB-NLC-NLC 


24.6 


20.3 


98.9 


14 


CRP-NRE-OLC 


23.5 


15.4 


97.6 


15 


CRP-NLC-03B 


23.5 


19.1 


98.3 


16 


CRP-NLC-NRE 


23.0 


12.1 


96.9 


17 


NRE-OLC-OLC 


19.9 


12.5 


97.9 


18 


CRP-NRE-NRE 


19.7 


9.0 


89.4 


19 


CRP-NLB-NLC 


19.2 


17.2 


98.3 


20 


NLC-03B-03B 


18.6 


14.2 


98.5 


21 


NRE-NRE-OLC 


18.1 


8.3 


96.7 


22 


CRN-NLC-NLC 


17.5 


17.4 


98.6 


23 


NRE-NRE-NRE 


17.3 


4.2 


77.3 


24 


OLC-OLC-OLC 


17.0 


13.0 


98.4 


25 


CRP-NLC-OLC 


17.0 


15.9 


98.3 


26 


CRN-CRN-CRN 


16.9 


41.4 


99.0 


27 


CRP-CRP-S3N 


16.8 


5.3 


93.6 


28 


NLC-NRE-NRE 


16.7 


6.1 


94.2 


29 


CRN-CRN-CRP 


16.5 


37.6 


98.9 


30 


C3P-NLC-NLC 


16.2 


33.5 


98.9 



*Coverage: The percentage of protein-ligand complexes (out of the 6,276 complexes) in which a specific triangle was present in the ligand- 
binding site or the whole protein. 



9 shows that, for a typical success case, MotifScore could 
easily distinguish reasonably good docking solutions 
(rmsd < 2 A) from bad ones. Intriguingly, there appears to 
be a very narrow funnel leading to the native state formed 
by very good docking solutions. This is quite distinct 
from that of a conventional energy-based scoring func- 
tion where the funnel leading to the energy minimum, 
which should be reasonably close to the native state, is 
usually much smoother [43]. This implies that, whereas a 
search algorithm, such as a genetic algorithm, may work 
efficiently with a conventional, energy-based scoring 
function to find good docking solutions, its direct adop- 



tion for use with MotifScore would not be ideal. On the 
other hand, the protein-ligand interaction motifs derived 
in this work, in their three dimensional arrangements of 
interacting atoms, capture the spatial arrangements of 
reasonably good docking solutions, so it is quite possible 
to develop a scheme to look up the table of interaction 
network motifs (e.g. Table 4) and directly home in on a 
reasonable docking solution based on the structure of a 
few top-ranked motifs. This could eliminate the time- 
consuming searching computation needed in conven- 
tional docking methods, or at least provide a good start- 
ing solution for further refinement. In addition, although 
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Figure 9 The lanscape of Motif Score. The scatterplot shows the 
landscape of MotifScore values for a typical success case (PDB code 
1 APB). The 100 docking solutions (decoys), marked by triangles, were 
taken from Wang's dataset and the square represents the native com- 
plex conformation. 

MotifScore is currently limited to scoring the interactions 
between protein and small molecules, the same network- 
based approach should be extendable to protein-protein 
and protein-DNA/RNA interactions, which form con- 
served spatial and chemical binding patterns (e.g., 
[44,45]). Work along these lines is currently undergoing 
in our laboratory. 

Conclusions 

MotifScore is a novel interaction-motif-based scoring 
function for protein-ligand docking. Despite the absence 
of mathematical models to mimic the force field of 
molecular interactions, MotifScore performed well in dis- 
tinguishing between good and bad docking solutions in a 
benchmark test set. Furthermore, owing to the network 
approach, MotifScore is intrinsically more able than con- 
ventional docking scoring functions to capture interac- 
tions involving more than two interacting sites, the tt-tt 
stacking of two aromatic rings being a prime example. 
The ligand-binding site-enriched interaction motifs iden- 
tified are in accord with existing knowledge on protein- 
ligand binding and may prove useful for binding site pre- 
dictions. Finally, the three-dimensional protein-ligand 
interacting motifs could provide very good templates for 
placing ligand molecules in fast, though coarse, protein 
docking computations. 

Availability and requirements 

The source code of MotifScore is available for download 
at: https://sourceforge.net/proiects/msdock/ . The pack- 
age contains a set of Perl scripts for computing the scor- 
ing function and for creating a Perl database file that 



stores the names of ligands and their atom types. It also 
offers some demonstration examples of how to obtain the 
final docking scores. These scripts work on a Unix or 
Linux platform and their download is free for academic 
users. 

Additional material 

Additional file 1 Supplementary data for details of motif related 
parameters. This file contains 2 tables and 3 figures. Table SI lists the lower 
and upper distance thresholds for each of the 280 atom type pairs. Table S2 
lists the binding site enrichment factor, Fb, of each of the 455 non-metal 
protein atom type triangles. Figs. SI and S2 display the optimization results 
of two parameters, Penalty weight Wand the raising percentage of lower 
thresholds, on the training dataset LPDB. Fig. S3 shows a schematic presen- 
tation for some of the atom types defined in Table 1 . 
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